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Abstract 

We propose a first-order method for stochastic strongly convex optimization that attains 0(l/n) rate of convergence, analysis 
show that the proposed method is simple, easily to implement, and in worst case, asymptotically four times faster than its peers. 
We derive this method from several intuitive observations that are generalized from existing first order optimization methods. 



I. Problem Setting 

In this article we seek a numerical algorithm that iteratively approximates the solution w* of the following strongly convex 
optimization problem: 

w* = argmin/(.) (1) 

where /(.) : F/ — > 7?. is an unknown, not necessarily smooth, multivariate and A-strongly convex function, with T f its convex 
definition domain. The algorithm is not allowed to accurately sample /(.) by any means since /(.) itself is unknown. Instead 
the algorithm can call stochastic oracles uj{.) at chosen points ii, . . . which are unbiased and independent probabilistic 
estimators of the first-order local information of /(.) in the vicinity of each xf. 

^ {fi{xi)^vM^i)} (2) 

where V denotes random subgradient operator, fi{.):Tf^Tl are independent and identically distributed (i.i.d.) functions 
that satisfy: 

(unbiased) E[/,(.)] =/(.) V^ (3a) 

(i.i.d) Gov (/,(.), /,(.)) =0 ^1^3 (3b) 

Solvers to this kind of problem are highly demanded by scientists in large scale computational learning, in which the first- 
order stochastic oracle is the only measurable information of /(.) that scale well with both dimensionality and scale of the 
learning problem. For example, a stochastic first-order oracle in structural risk minimization (a.k.a. training a support vector 
machine) can be readily obtained in 0(1) time |JJ. 

II. Algorithm 

The proposed algorithm itself is quite simple but with a deep proof of convergence. The only improvement comparing to 
SGD is the selection of step size in each iteration, which however, results in substantial boost of performance, as will be shown 
in the next section. 

III. Analysis 

The proposed algorithm is designed to generate an output y that reduces the suboptimality: 

^(2/) = /(y)-niin/(.) (4) 

as fast as possible after a number of operations. We derive the algorithm by several intuitive observations that are generalized 
from existing first order methods. First, we start from worst-case upper-bounds of S{y) in deterministic programming: 

Lemma 1. (Cutting-plane bound Given n deterministic oracles f2„ — {uj{xi), . . . ,uj{xn)} defined by: 

^{xz) ^ {f{xt),\/f{x^)} (5) 
If f{-) is a X-strongly convex function, then min/(.) is unimprovably lower bounded by: 



min/(.) > max pi{w*) > min max Pi{.) 

i—l...n i—l...n 



(6) 
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Algorithm 1 



Receive xi, F/, A 
ui ^ 1, yj <- xi 
Receive /i(a;i), v/i(a;i) 

Pi(.) ^ Tf + (v/i(2:i), . - X,) + III. _ 

for i = 2, . . . , n do 

Xi <r- argminPj_ij.) 

Receive Mxi),\/f,{xi) 

P^{■) ^ {/.(a;,) + is/Mxt), ■ ~ x^) + |||. - .T.lpj 

Vi ^ (1 — 2~)y»-i + 

Mi ^ ^ 

end for 

Output y„ 



where w* is the unknown minimizer defined by ([T]), and Pi{.) : F/ — ?> 7?. are proximity control functions (or simply prox- 
functions) defined by pi{.) = f{xi) + {\/f{xi), . - Xi) + ^\\. - XiW^. 

Proof: 

By strong convexity of /(.) we have: 

Mf{.\\x,)>^\\.-x,\\'' 

=^ /(.) > max (7) 

2— l,...,n 

=^ Tain f {.) = f {w* ) > max pi(w*)> mm max Pi{.) (8) 

i— l,...,n z— l,...,n 

where B/(j:i||a;2) ~ f{xi) — /(a;2) — {\7f{x2),xi — X2) denotes the Bregman divergence between two points xi,X2 G F/. 
Both sides of (|7|l and (O become equal if /(.) — maxi=i ... „pi(.), so this bound cannot be improved without any extra 
condition. 

■ 

Lemma 2. (Jensen's inequality for strongly convex function) Given n deterministic oracles Qn = {'-^{xi), ■ ■ ■ , a;(x„)} defined 
by (|5]l. // /(.) is a X-strongly convex function, then for all ai,...,a„ that satisfy Q^* = IjCti > Vi, f{y) is 

unimprovably upper bounded by: 
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where y = 
f roq/!" 

By strong convexity of /(.) we have: 



/(y) < fix^) - {S7l{y),Xi-y) - ^\\x, 



/(y) < ^oi.j{xi) - (v/(y),X!"*^» -y) - ^X!"^"^^ "^"^ 



2 

i=l 



n A ^ 



2 

1=1 
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Both sides of all above inequalities become equal if /(.) = -iH - — y|P + (ci, .) + C2, where ci and C2 are constants, so this 
bound cannot be improved without any extra condition. 

■ 

Immediately, the optimal A that yields the lowest upper bound of f{y) can be given by: 



^ = arg min ^^<'^aif{xi) ~ -'^ai\\xi -'^ajXjW'^ > (10) 



Qi>OVi 



Combining with (|6]l, we have an deterministic upper bound of S{y): 



S{y)< „min^ < - -^ttillxi -^ajXjlP > - _niax^Pi(w*) (11) 



ai>OVi 



This bound is quite useless at the moment as we are only interested in bounds in stochastic programming. The next lemma 
will show how it can be generalized in later case. 

Lemma 3. Given n stochastic oracles f2„ = {uj{xi), . . . ,uj{xn)} defined by (|2|i, ify{., ■■■,■) ■ xT^ f '^"'^ U{., . . . , .) : 
X — > 7?, are functionals of fi{.) and Xi that satisfy: 

U{f,...,f,xi,...,Xn)> S{y{f,...,f,xi,...,Xn)) (12a) 
U{fi,...,fn,Xi,...,Xn) is convex w.r.t. /i,...,/„ (12b) 
E[(V/...,/f/(/, ...,/, XI, ... , a;„), [/i -/,...,/„- ff)] < (12c) 

then E[S{y{f, ...,/, xi, ... , x„))] is upper bounded by U{fi, ...,/„ , Xi , . . . , Xn }• 
Proof: 

Assuming that (5i(.) : F/ — 7^ are perturbation functions defined by 

^.(•) -/.(■)-/(•) (13) 

we have: 

U{fl,...,n, 2:1, ...,n) >U{f + 5i,...J + Sn, Xi^....^) 

(by (ing) = [/(/, ...,/, a;i,...,„) + (V/... ••■,/, -t^i,...,™), ['^i.... J^) 

(by (1123) > ...,/, xi,. ..,„)) + (V /... ■■■,/, a:i,...,„), [<5i,...,„]^) (14) 

Moving 6i to the left side: 

E[5(2/(/, ...,/, a;i,...,„))] < C/(/i....,„, xi,...,„) + E[(v/,...,/C/(/, .-.,/, a:i,...,„), [Si,...,nf)] 
(by ([l2iJ)<C/(/i,...,„,xi,...,„) 



Clearly, according to (I12bb . setting: 



1 X ^ ^ 

f^(/i,...,n,a;i,....n) = min ^^/^(xi) - - a^l |xj - a^Xj |p > - max (15) 

Qi = l — ' Z -"^ — ' -"^ — ' 1=1. ..n 



by substituting /(.) and pi{.) in (fTTl i respectively with fi{.) defined by Q and pi{.) -.Tf^TZ defined by: 

P^{■) ^ Mxt) + {•^7Mx^), ■ - X.^) + ^\\. - X,\\^ (16) 

is not an option, because min^'i a.=i{.} and — max^^i. ..„{.} are both concave, X]"=i ctifi{xi) and pi{w*) are both linear to 

a7>0Vi 

and -I Q^ill^^j ~ Sj=i "^i^j lP irrelevant to fi{.). This prevents asymptotically fast cutting-plane/bundle methods 

El, JSl, im from being directly applied on stochastic oracles without any loss of performance. As a result, to decrease (|4| and 
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satisfy ( |12bl l our options boil down to replacing min^" q.=i{ } and — maxi^i „{.} in (fTSt witii their respective lowest 

Q7>0Vi 

convex upper bound: 



n 



I li ft I I, 
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i—1 i—1 i—1 i—1 

n X ^ ^ 

= '^aifi{xi) - -'^ai\\xi - '^ctjX-iW'^ - P„(w*) (17) 

i—1 i—1 i—1 

where Pn{.) : T f ^ IZ is defined by: 

n 

P„(.)=5]ftp,(.) (18) 
and A — [ai, . . . , a„]"^, B — [f3i, . . . , (inY' are constant n-dimensional vectors, with each a^, (ii satisfying: 



^ a, = 1 a,; > Vz (19a) 

i=l 

n 

5^/3^ = 1 /3, >0 Vi (19b) 

accordingly y((/i, xi, ... , a;„) can be set to: 

n 

y(A,B){xi,- . ■ ,Xn) =^a^Xi (20) 
i=l 

such that ( |12a| l is guaranteed by lemma |2] It should be noted that A and B must both be constant vectors that are independent 
from all stochastic variables, otherwise the convexity condition ( |12bt may be lost. For example, if we always set fii as the 
solution of the following problem: 

Pi = arg max^ I ^^l3iPi{w*) > 

5Z"=1 ^i-l I I 

,3i>0Vi ^ 

then Pn{w*) will be no different from the cutting-plane bound (|6]l. Finally, jl2cl i can be validated directly by substituting (fTTI i 
back into (I12cl i: 

n 

{Vf....jUiA^B){f, •■•,/, 2^1,. ..,«), = [(^^^ " A)'5»(a;«) - {VS^{X^), |3^{w* - X^))] (21) 

1=1 

Clearly E[(ai — /3i)(5i(a;i)] = and E[{\/6i{xi) , w*)] — can be easily satisfied because and are already set to constants 
to enforce ( |12b| i, and by definition w* = argmin/(.) is a deterministic (yet unknown) variable in our problem setting, while 
both Si{xi) and \/Si{xi) are unbiased according to ( l3al i. Bounding E[{\/di{xi), Xi)] is a bit harder but still possible: In all 
optimization algorithms, each Xi can either be a constant, or chosen from T f based on previous . . . , fi-i{-) (xi cannot 

be based on fi{.) that is still unknown by the time Xi is chosen). By the i.i.d. condition ( [3b] i. they are all independent from 
which implies that Xi is also independent from fi{xi): 



E[{vS,{x,),x,)]^0 (22) 

As a result, we conclude that (ISTT i satisfies IE[(V/,..../C^(A.s)(/i /i 2;i,. ..,„), [(5i_...^„]"^)] = 0, and subsequently Ui^a.b) 
defined by ( fTTj i satisfies all three conditions of Lemma |3] At this point we may construct an algorithm that uniformly reduces 
maxu,. U(^A.B) by iteratively calling new stochastic oracles and updating A and B. Our main result is summarized in the 
following theorem: 

Theorem 1. For all X-strongly convex function F{.), assuming that at some stage of an algorithm, n stochastic oracles 
a)(xi), . . . , u}{xn) have been called to yield a point y(^A",B") defined by ( |20] | and an upper bound U(^A"^b") defined by: 
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U{A",B'-){fi,...,n,xi,...,n) = C^(A" ,_B") (/i,...,n, . + -^^ai\\xi - ^ ^jX-jW^ + {Pn{w*) - minP„(.)j 



(23) 



i=l 



where A" and B" are constant vectors satisfying M9\ (here " in A" and i?" denote superscripts, should not be confused with 
exponential index), if the algorithm calls another stochastic oracle uiixn+i) at a new point Xn+i given by: 



and update A and B by: 



Xn+i = argminP„(.) 



G2 



where G = max|| \j /i(-)ll> f^^" bounded by: 



A 



Proof: 

First, optimizing and caching all elements of A" or i?" takes at least 0{n) time and space, which is not possible in large 
scale problems. So we confine our options of and to setting: 



(24a) 
(24b) 

(25) 



1 J = (1 - • • ■ 



n+l 



=(i-/3;:+|)[/3r,...,^; 



(26a) 
(26b) 



such that previous Yll=i ctiF{xi) and PiPii-) ^^'^ ^e summed up in previous iterations in order to produce a 1-memory 

algorithm instead of an oo-memory one, without violating ( fT9] l. Consequently [/(^.i+i^^+i) can be decomposed into: 



n+l 



1=1 

n+l 



(by (O, (HI, (O) < ^ c,7^^f{x^) 
(by (ESll) 



Vi+l ^ n+l 

.1=1 i=l 



+ 



n+l 
^n+1 



/(a;„+i) -/3,;j+!p„+i(i;) 



2A 



VPn+i(i:)||' 



where x* ~ argminP„(.), setting a"^i[ = b"'^\ and a;„+i = i* eliminates the second term: 



2A 



\i=l 



VP«+i(a;„+i)|| 



(by m)={l~ a-ll)UiA'^,B'^) + ^^^11 V /.+i(^„+i)ll' 

(G > II V /.(OH) < (1 - a-tl)U(A-,B-) + 

Let Ui — ^U(A\B^)^ minimizing the right side of < l27b over ajjjl} yields: 

77-1-1 -r/i \ , 2t ^ TT 

"n+l = argm^in{(l - a)un + a \ = ^ = -^U(a^^b^) 



(27) 



(28) 
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In this case: 



Un+l <Un- ^ (29) 

2A ~ 2 

■ 

Given an arbitrary initial oracle Cj{xi) and apply the updating rule in theorem[T]recursively results in algorithm 1, accordingly 
we can prove its asymptotic behavior by induction: 

Corollary 1. The final point i/n obtained by applying algorithm 1 on arbitrary X-strongly convex function /(.) has the following 
worst-case rate of convergence: 

2(^2 

E[/(2/„)]-min/(.)< 



\{n + 3) 

Proof: 
First, by ( |29] l we have: 



1 1 1111 
> 7 r = \ > h - (30) 

Un+l Un (1 - ^) Un 4 - U„ M„ 4 

On the other hand, by strong convexity, for all xi G we have: 

/(a;i)-min/(.)<|^||V/(:^i)ll'<^ (31) 
Setting U{i^i) = ^ as intial condition and apply dSOb recursively induces the following generative function: 

1 n — 1 n + 3 

— >^ + —r- = —r- 

Un 4 4 

4 



n + 3 



A(n + 3) 

=^ E[/(y„)] - min/(.) < ~ 3 ^"'^"'^^ " ~ 1^"^"'*^ " minP„(.) 

^ ^ i—l 

■ 

This worst-case rate of convergence is four times faster than Epoch-GD (^-) lE], llSj or Cutting-plane/Bundle Method 
^^-ttt^TTYt) 101' M, 0, and is indefinitely faster than SGD (^^M^!) |[T], Q. 



A n+2- 



= 2 I 4G^ 



IV. High Probability Bound 
An immediate result of Corollary [T] is the following high probability bound yielded by Markov inequality: 

Pr (Siun) > ) < 77 (32) 

V A(n + 3)r;y 

where 1 — 77 G [0,1] denotes the confidence of the result y„ to reach the desired suboptimality. In most cases (particularly 
when 7/ « 0, as demanded by most applications) this bound is very loose and cannot demonstrate the true performance of the 
proposed algorithm. In this section we derive several high probability bounds that are much less sensitive to small 7/ comparing 
to (Hall. 

Corollary 2. The final point y„ obtained by applying algorithm 1 on arbitrary X-strongly convex function F{.) has the following 
high probability bounds: 
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where constants G — max || V^ilOl l> '^^ — max Var(v/i(-)) '^''^ maximal range and variance of each stochastic subgradient 
respectively, and D = maxj-j j-^gr/ H^^i ~ 2;2|| is the largest distance between two points in Tf. 

Proof: 

We start by expanding the right side of (fT4] |. setting A" = and substituting (1211 ) back into (fl4b yields; 

n 

<5'(yn) < U(A^^A"){fl,....n,Xl,...,n) - ^ a'i S^{Xi) , Xi - W*) 

(by Corollary [D < ^ + V a^r, (34) 

i—l 

with each n = ~{\/Si{xi),Xi — w*) satisfying: 

(Cauchy's inequality) - |1 V <5j(a;i)lll|a;j - < < || y Si{xi)\\\\xi -w*\\ 

-GD <n<GD (35) 



Var(r,) = K[{{y5,{x,), x, - w*) - E[{^S,ix,), x, - w*)])^] 
(by m) = E[((v<5.(x,), X, - w*))^] 
(Cauchy's inequality) < E[|| v St{xi)\\'^\\xi - 

< D^E[\\^d,ix,)\\^] = D^Yai (vMx^)) < D^a^ (36) 



This immediately expose Sn{y{A".A"){xi,...,n)) to several inequalities in non-parametric statistics that bound the probability 
of sum of independent random variables: 



(generalized Chernoff bound) Pr '^^''^i > < exp 



4Vare.:Li<^0 



(by@)<exp^ ^^,^,^„^^^^„^J (37a) 



/ " \ 1 

(Azuma-Hoeffding inequahty) Pr [ ""''i >t\<- exp 



2t 



2 



I]"=i("")^(niaxr.j - minri)2 



(by (l35ll) < - exp <^ = ^ (37b) 

2 n AG^D^ j:tiKy 



(Bennett inequahty) Pr '^^^^ ^ f£ exp 



t , A tmaxllaieil 
-In 1 ' 



2max||afej|| \ Var (X]r=i Q^" 



(by (O, (O) < exp I --^-^ ( 1 + ^"^7^ I I (37c) 

[ 2GL»maxai \ Ei=i«) / J 

In case of algorithm 1, if A" is recursively updated by (|24] |. then each two consecutive a" has the following property: 
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(by (EHi) 



(by (ESl), (|29|) 



2 
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(U^^l < 1) > 1 

a'ii > a- 



^4+1 



Un-l 



< 



Accordingly J27=ii'^?)'^ ^'^^ bounded by 



An 
{n + 2f 



< 



(38) 



(39) 



Eventually, combining (|34] | (|37] |. (1381 ) and ( [39] l together yields the proposed high probability bounds ( l33T l. 

■ 

By definition G and cr are both upper bounded by G. And if F/ is undefined, by combining strong convexity condition 
B/(xi|| argmin/(.)) = f{xi) — min/(.) > -IHxi — argmin/(.)|p and ( [3T] l together we can still set 



|2 ^ ^ 

--ill <^ 

such that _D = while arg min /(.) is always included in F/. Consequently, even in worst cases ( [32l ) can be easily superseded 
by any of ( l33b . in which 77 decreases exponentially with t instead of inverse proportionally. In most applications both G and a 
can be much smaller than G, and a can be further reduced if each Co{xi) is estimated from averaging over several stochastic 
oracles provided simultaneously by a parallel/distributed system. 

V. Discussion 

In this article we proposed algorithm 1 , a first-order algorithm for stochastic strongly convex optimization that asymptotically 
outperforms all state-of-the-art algorithms by four times, achieving less than 5* suboptimality using only ^ — 3 iterations and 
stochastic oracles in average. Theoretically algorithm 1 can be generalized to strongly convex functions w.rt. arbitrary norms 
using technique proposed in Q, and a slightly different analysis can be used to find optimal methods for strongly smooth 
(a.k.a. gradient lipschitz continuous or g.l.c.) functions, but we will leave them to further investigations. We do not know if this 
algorithm is optimal and unimprovable, nor do we know if higher-order algorithms can be discovered using similar analysis. 
There are several loose ends we may possibly fail to scrutinize, clearly, the most likely one is that we assume: 

max5(?/) — max{/(y) — min/(.)} < max/(y) — minmin/(.) 

However in fact, there is no case arg maxy f{y) — arg min/ min /(.) Vy G F/, so this bound is still far from unimprovable. 
Another possible one is that we do not know how to bound ^ — UnW^ by optimizing Xn and a", so it is isolated 

from ( l23T l and never participate in parameter optimization of ( |27] |. but actually it can be decomposed into: 



n+l n+1 

i=l i=l 

n+l . (n+l 

= ^ari||x.-,„|p-i-||V.„ E 

1=1 I 4 = 1 



ai\\xi - J/nlP > I 



En+l|| Il2 I m Il21 y^n+l) ii |i2 
"i \\Xi-yn\\ + "n+l |_||a;„+l - y„|| J ||yn-2:„+ir 



2A 
2A 



Wn -^n+l I 
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such that a, 



n+l - 




Vn — Xn+iW"^ can be added into the right side of dZTl i. unfortunately, we still do not know how 



to bound it, but it may be proved to be useful in some alternative problem settings (e.g. in optimization of strongly smooth 
functions). 

Most important, if /(.) is A-strongly convex and each fi{.) can be revealed completely by each oracle (instead of only its 
first-order information), then the principle of empirical risk minimization (ERM): 



easily outperforms all state-of-the-art stochastic methods by yielding the best-ever rate of convergence |7 |, and is still 
more than four times faster than algorithm 1 (through this is already very close for a first-order method). This immediately 
raises the question: how do we close this gap? and if first-order methods are not able to do so, how much extra information 
of each fi{.) is required to reduce it? We believe that solutions to these long term problems are vital in construction of very 
large scale predictors in computational learning, but we are still far from getting any of them. 
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